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[57] ABSTRACT 

A multimedia file and method for forming the same organize 
instances of multimedia information according to media 
information type (e.g., audio, video, MIDI, etc.), encoding 
format, media subtype, and encoding rate. Several instances 
of the same media information type are included, each of 
such instances having a different encoding format, media 
subtype, and/or encoding rate. A presentation application 
utilizes the subject multimedia file to identify, select, and 
present specific instances of the multimedia information, 
permitting the presentation application to customize a mul- 
timedia presentation based on, among other things, the rate 
of the connection from the presentation application to the 
presentation consumer and the decoding capabilities of the 
presentation consumer. 
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MULTIMEDIA FILE, SUPPORTING 
MULTIPLE INSTANCES OF MEDIA TYPES, 
AND METHOD FOR FORMING SAME 

CROSS-REFERENCE TO RELATED 5 
APPLICATIONS 

This application is related to the following U.S. patent 
applications, all of which are assigned to the assignee of this 
application and all of which are incorporated by reference 
herein: 10 
Device, System And Method Of Real-Time Multimedia 
Streaming (Ser. No. 08/636,417, to Qin-Fan Zhu, Man- 
ickam R. Sridhar, and M. Vedat Eyuboglu, filed Apr. 
23, 1996, now U.S. Pat. No. 5,768,527 (issued Jun. 16, 
1998) 

Improved Video Encoding System And Method (Ser. No. 

08/711,702, to Manickam R. Sridhar and Feng Chi 

Wang, filed on Sep. 6, 1996, pending and 
System, Device, And Method For Streaming A Multime- 20 

dia File Method (Ser. No. 08/711,701, to Manickam R. 

Sridhar, Tom Goetz and Mukesh Prasad, filed on Sep. 

6, 1996 herewith, pending. 

BACKGROUND 25 

1. Field of the Invention 

The invention generally relates to real-time multimedia 
applications and, more particularly, to the streaming of 
real-time multimedia information over a communication 
network. 30 

2. Discussion of Related Art 

Generally speaking, multimedia applications present 
related media information, such as video, audio, music, etc., 
on a presentation device, such as a computer having a 35 
display and sound system. Some multimedia applications 
are highly interactive, whereas other applications are far less 
interactive. For example, a game is a highly interactive 
application in which the application must respond to many 
user inputs such as keyboard commands and joystick 4Q 
movements, whereas viewing a video clip is less interactive 
and may only involve start and stop commands. Moreover, 
multimedia applications may be directed to standalone 
single computer contexts, or they may be directed to 
distributed, net work -based contexts. 45 

At a very high level of abstraction, following the 
producer-consumer paradigm, any multimedia application 
involves producing and consuming the related multimedia 
information. The above examples of highly interactive and 
less interactive and standalone and network-based applica- 50 
tions differ in the manner in which the information is 
produced and consumed and the complexity in controlling 
the production and consumption. 

For example, in PCs and other standalone contexts, the 
information need only be read off a local CD Rom or the 55 
like, and thus, producing the information to the consumer 
involves relatively predictable characteristics and requires 
relatively simple control logic. In network-based contexts, 
on the other hand, the information must be produced over 
the network and is thus subject to the unpredictable char- 6 0 
acteristics and intricacies of the network: data may be lost, 
performance may vary over time, and so on. Consequently, 
the control logic may need to be relatively complicated. 

In either the standalone or network-based context, con- 
suming the information involves presenting the related 65 
information to the corresponding presentation components 
in a controlled manner and in real time. For example, to 



2 

provide intelligible audio-video clips, the video data must be 
provided to a video driver and audio data must be provided 
to a sound card driver within specified timing tolerances to 
maintain intra- and inter-stream synchronism. Intra-stream 
synchronism means that a given stream, such as audio, is 
presented in synchronism within specified time 
relationships, in short, that the stream itself is coherent. 
Inter-stream synchronism means that multiple related 
streams are presented in synchronism with respect to each 
other. Concerning intra-stream synchronism, all streams, for 
the most part, should present data in order. However, users 
are more forgiving if some streams, such as video, leave out 
certain portions of the data than they are of other streams, 
such as audio, doing the same. A video stream with missing 
data may appear a little choppy, but an audio stream with 
missing data may be completely unintelligible. Concerning 
inter-stream synchronism, poor control will likely result in 
poor "lip synch," making the presentation appear and sound 
like a poorly dubbed movie. 

In the network-based context, one simple model of pro- 
ducing the information involves the consuming entity to 
request the downloading of the multimedia information for 
an entire presentation from a server, storing the multimedia 
information. Once downloaded, the client may then 
consume, or present, the information. Although relatively 
simple to implement, this model has the disadvantage of 
requiring the user to wait for the downloading to complete 
before the presentation can begin. This delay can be con- 
siderable and is especially annoying when a user finds that 
he or she is only interested in a small portion of the overall 
presentation. 

A more sophisticated model of producing information 
involves a server at one network site "streaming" the mul- 
timedia information over the network to a client at another 
site. The client begins to present the information as it arrives, 
rather than waiting for the entire data set to arrive before 
beginning presentation. This benefit of reduced delay is at 
the expense of increased complexity. Without the proper 
control, data overflow and underflow may occur, seriously 
degrading the quality of the presentation. 

Many modem multimedia applications involve the trans- 
fer of a large amount of information, placing a considerable 
load on the resources of the network, server, and client. The 
use of network-based multimedia applications appears to be 
growing. As computers become more powerful and more 
people access network-based multimedia applications, there 
will be an increased demand for longer, more complicated, 
more flexible multimedia applications, thereby placing even 
larger loads and demands on the network, server, and client. 
The demand placed on servers by these ever-growing mul- 
timedia applications is particularly high, as individual serv- 
ers are called upon to support larger numbers of simulta- 
neous uses: it is not uncommon even today for an Internet 
server to handle thousands of simultaneous channels. 
Consequently, there is a need in the art for a device, system, 
and method that, among other things, 

can handle longer, more complicated presentations; 

utilize a network's resources more efficiently; and 

utilize a server's and client's resources more efficiently. 

SUMMARY 

In short, the invention involves a new file format for 
organizing related multimedia information and a system and 
device for, and method of, using the new file format. The 
invention eases the management and control of multimedia 
presentations, having various media streams, each of a 
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specific type, each specific type further classified by encod- The encoding operation may include encoding the media 

ing type, subtype, and encoding rate. Thus, with the stream for a pre-determined data rate, and the packetizing 

invention, an application may support several instances of a operation may include the step of recording an assigned 

particular media type, with each instance having different importance value to each packet. 

characteristics. For example the application may support 5 Due tQ its structure the file format may ^ easily Musd 

multiple audio streams, with each stream in a different {Q Qr ^ instances of media Mormation 
language, from which a user may select. Analogously, with 

the invention, the application may choose a given instance BRIEF DESCRIPTION OF THE DRAWINGS 

of a media stream based on the network's characteristics; for 

example, the application may choose an audio subtype that JQ i n the Drawing, 

is encoded for a transmission rate that matches the network's FIG. 1 is a block diagram showing the novel multimedia 

charactenst.es. The invention reduces a server s memory fifc at fl ^ Qf abslraction; 

and processing requirements, thus allowing a server to . . £ , . * r . 

simultaneously service more requests and support more FI ?- 2 ^ 1S a block diagram showing the file header of the 

channels. And, the invention dynamically adapts the media's multimedia file;. 

streaming rate to use the network's resources more efE- 35 FIG. 2B is a block diagram showing the format of a file 

ciently while minimizing the effects of the adaptation on the header preamble; 

quality of the presentation. FIG. 2C is a block diagram showing the format of a media 

The invention includes a multimedia file for organizing at instance descriptor; 

least one type of media information on a computer-readable FIG. 3 is a block diagram showing the format of a media 

medium, such as a CD Rom, hard disk, or the like. The block* 

multimedia file is capable of storing and identifying multiple ^ 4A ^ a ^ m sh g c ^ 

instances of the at least one media type. For example, the block directory format- 

media information types may include audio, video, MIDI, . ' 

etc., and the multiple instances may correspond to a French FIG - 48 15 a block dia S ram showin S the formal of a 

instance of audio, an English instance, and a Chinese 25 g enenc directory preamble; 

instance. In this fashion, a presentation application, using FIG. 4C is a block diagram showing the format of a 

the appropriate logic, may select and present any of the generic packet descriptor; 

multiple instances of the media type. FIG. 4D is a block diagram showing the format of a 

One embodiment organizes the file as a file body for 3Q generic media block body; 

storing a plurality of media blocks, each containing infer- piG. 4E is a block diagram showing the format of a 

mation for a corresponding instance, and as a file header generic packet; 

having information for referencing the contents of the file pig. 5 is a block diagram showing the format of an H.263 
body. The header may include media instance descriptors, me dia block- 
each including information for describing the media block ^ c . . f < e .. mi 
and including information for locating the data in the body. 35 ™- « 15 a block diagram showing the format of a MIDI 
Some attributes of a given instance included in the descrip- media block P reamble > 

tor include a media type, e.g., audio, an encoding type, e.g., FIG - 7 is a block diagram showing the format of a MIDI 

MPEG, a subtype, e.g., English, and a streaming, or packet; 

decoding, rate. 40 FIG. 8 is a block diagram showing a system for creating 

One embodiment organizes the data in the body in pack- a multimedia file; 

etized form, the form corresponding to a predefined network FIG. 9 is a block diagram showing a system embodying 

protocol, such as the UDP interface of the TCP/IP protocol the invention in a client-server context; 

suite. The packets may or may not be on a one-to-one p IG 10 is a block diagram showing the client and server 

correspondence with the presentation units, eventually pro- 45 component s 0 f the system; 

cessed by the presentation application. Thus a single packet RG n fa a flow m sh an imerac|ion 

may have several audio blocks merged into it, or a single belween lhe c]iem and the s 

video frame may be divided into several packets. . . 

1T,e packet descriptors, used to describe and locate the u 12 15 ■ ' flow dl3 & am showin S streamm g of 

the server - and 

packets, as well as the packets themselves may include 50 ' 

importance information indicating the importance of a given FIG. 13 is a flow diagram showing the retransmit logic of 

packet to the perceived quality of the eventual presentation. me server. 

Some packets are critical, whereas other may be dropped nCTAII nccrDI|ynnw 

without severely degrading the quality of the presentation. IJb 1 A1LbU Uh^LKIKl 1UN 

Keeping information in prepacketized form, requires less 55 \ Q short, the invention involves a new file format for 

resources by servers and the like, using files organized organizing related multimedia information and a system and 

according to the format. Among other things, the format device for, and method of, using files organized according to 

alleviates servers from having to keep huge packet windows. me new format. The invention eases the management and 

To form a file according to the format, one embodiment control of multimedia presentations, having various media 

forms a file body, containing at least two instances of the at 60 streams, each of a specific type, each specific type further 

least one type of media information, and forms a file header classified by encoding type, subtype, and encoding rate, 

identifying each instance so that a presentation application Thus, with the invention, an application may support several 

may select and present any of the media types and any of the related audio streams, such as English and French, from 

instances thereof. which a user may select. Analogously, with the invention, 

To form the body, media information is encoded into an 65 the application may choose a particular instance of a media 

encoded media stream, which is then packelized, and then stream based on the network's characteristics; for example, 

formed into a media block. the application may choose an audio stream that is encoded 
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for a transmission rate that matches the network's charac- 
teristics. The invention reduces a server's memory and 
processing requirements, thus allowing a server to simulta- 
neously service more requests and support more channels. 
And, the invention dynamically adapts the media's stream- 
ing rate to use the network's resources more efficiently while 
minimizing the effects of the adaptation on the quality of the 
presentation. 
I. The File Format 

The new file format, among other things, allows various 
types and subtypes of multimedia information to be 
organized, maintained, and used as a single file. The file 
format simplifies the server by not requiring the server to 
know, for example, which video files are related to which 
audio files for a given application, or to know how to locate 
and use related files, each with their own internal 
organization, corresponding method of access and process- 
ing. 

The file format allows multiple instances of a single 
media type to be stored in the file. Multiple instances of a 
single media type may be desirable for supporting alternate 
encodings of the same media type, for example, an audio 
segment in multiple languages. This flexibility allows a 
single file to contain, in effect, multiple versions of the same 
presentation. Each instance corresponds to a "presentation's 
worth" of information for that media type. For example, 
with the audio media type, an instance may involve the 
entire soundtrack in French or the entire soundtrack encoded 
at a particular rate. 

The file format allows media instances to be added and 
deleted to a file. This feature allows the file to be updated as 
new media types and new media segments are developed, 
without requiring modification of the server's or client's 
logic to support the newly-added instances. This flexibility 
makes it easier to modify, create, and maintain large, com- 
plicated multimedia presentations. 

The file format allows the server to implement more 
flexible and powerful presentations. For example, the server 
could support multiple languages as various subtypes of an 
audio stream. In addition, the server could support multiple, 
expected transfer rates. For example, a video media type 
may be implemented as a subtyped instance having pre- 
packetized video data encoded for a target transfer rate of 
28.8 kb/s or encoded for a target transfer rate of 14.4 kb/s. 

Moreover, when properly used by a server or other 
application, files organized according to the new format will 
reduce the amount of memory and processor resources 
required to stream the file's contents to a client or the like. 
These advantages are further discussed below. 

The new file format 100 is shown at a high level of 
abstraction in FIG. 1. The file format 100 includes a file 
header 110 and a file body 120. In short, the file header 110 
describes the file itself and the contents of the file body 120 
and includes information used to locate data in the file body. 
The file body 120 includes more information used to locate 
data in the file body as well as including the actual data used 
during a presentation. 

More specifically, the file header 110 includes a file 
header preamble 210 and a number of media instance 
descriptors 220, shown in FIG. 2 A. The file header preamble 
210, shown in more detail in FIG. 2B, includes a field 211 
containing a file signature, a field 212 containing the size of 
the header, a field 213 containing the major version number, 
a field 214 containing the minor version number, and a field 
215 containing the number of media instances in the file. 
(The major and minor version numbers are used for revision 
control) The file header preamble 210 also includes reserved 
fields 216 and 217 to allow for future expansion of the 
preamble. 
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As shown in more detail in FIG. 2 C, each media instance 
descriptor 220 includes a variety of fields 221-225 which 
are used to describe and identify a media instance. Some of 
the fields are used to describe various characteristics or 

5 attributes about the instance's data that will be presented, 
whereas other fields are used to locate and select the data. 
Field 221 indicating the offset, e.g., number of bytes, from 
the beginning of the body to the corresponding media block. 
The media instance descriptor 220 also includes a field 222 

10 indicating the media type of the corresponding media block, 
for example, video, audio, MIDI, and other existing and 
future media types. Field 223 indicates the encoding type of 
the corresponding media block, for example, H.263, H.261, 
MPEG, G.723, MIDI, and other standard or proprietary 

15 encoding types. Field 224 indicates a corresponding 
subtype, for example, English audio, French audio, QCIF 
video, CIF video, etc. Field 225 indicates an encoding rate 
of the corresponding media block; for example, for video 
information, the encoding rate indicates the target data rate 

20 for which the video information was encoded by a video 
encoder, such as the video encoder described in the related 
U.S. patent application "Improved Video Encoding System 
and Method", identified and incorporated above, whereas for 
audio information, the encoding rate might indicate one of 

25 a number of audio sampling rates. 

As will be explained below, the number contained in field 
215 is not necessarily the same as the number of media 
streams that will eventually be involved in a presentation. 
The file contains a number of potentially related media 

30 streams, or instances, organized according to media type, 
encoding type, subtype, and rate. A presentation, on the 
other hand, will likely involve only a subset of the available 
media streams, typically one instance of each of the plurality 
of media types. For example, a given presentation will likely 

35 involve only one of the multiple audio (language) subtyped 
instances that may be provided by a file organized according 
to the format. The same can be said for data encoded at 
different rates, and of course, a user may not be interested in 
a full compliment of media streams, e.g., the user may not 

40 be interested in receiving audio, even if it is supported by a 
file. Depending upon the services supported by the server 
(more below), the actual media types and particular 
instances of those media types involved in a presentation 
may be controlled by an end user, e.g., which language, and 

45 may also be controlled by the system, e.g., which encoded 
rate of audio. 

Once the media types and particular instances of those 
media types have been determined, for example, by being 
selected by the user or the server, the server will construct 

50 data structures using information from the file header 110, 
described above, so that the server can index into and iterate 
over the data packets contained in the file body 120. 
(Indexing and iteration logic are known) The server can also 
use the header's information to perform revision control and 

55 other known maintenance operations, discussed below when 
describing an exemplary server. 

The data contained in the file body 120 is organized as 
contiguous media blocks 310, one media block for each 
instance of a media type, as shown in FIG. 3. Each media 

60 block 310 includes a media block directory 320 and a media 
block body 330. The media block directory 320 includes 
information that may be used to locate information in the 
media block body 330, and the media block body 330 
includes the actual data that will eventually be presented. 

65 This data is stored in the media block body 330 in pre- 
packetized form. "Pre-packetized" means that the data 
stored in media block body 330 is organized as discrete 
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packets of information that can be transported without body and a presentation length field 413 indicating the total 

requiring any processing by the server to build the packets. time duration of the presentation. A number of reserved 

An exemplary embodiment, discussed below, pre-packetizes fields 412 and 414 provide additional storage for media 

the data so that it can be applied directly to the User instance specific information (more below). 

Datagram Protocol (UDP) layer, which is part of the TCP/IP 5 A packet descriptor 420, shown in FIG. 4C, describes a 

protocol suite. (UDP and TCP/IP are known). single packet in the media block body. There is a one-to-one 

The pre-packetization process is media instance specific. correspondence between packet descriptors 420 in the media 

For example, the G.723 audio encoding standard encodes an block directory and packets in the media block body. Each 

audio stream into a stream of blocks, in which each block packet descriptor 420 includes a packet offset field 421 

represents 30 milliseconds of audio information. One 10 indicating the offset, e.g. bytes, from the start of the media 

method of pre-packetizing the audio would be to form an block to the corresponding packet, a time stamp field 422 

audio packet for each G.723 block. A potentially more indicating the start time for the packet relative to the start 

efficient method, however, merges many g.723 blocks into a time of the presentation, an importance field 423 indicating 

packet, for example, 32 G.723 blocks to form an audio the relative importance of the corresponding packet (more 

packet representing 960 milliseconds' worth of audio infor- 15 below), and a packet length field 425 indicating the length, 

mation. e.g. bytes, of the corresponding packet. 

Pre-packetizing video information, on the other hand, A generic media block body 430, shown in FIG. 4D, 

may benefit from dividing, rather than merging, presentation includes a number of packets 440. This number represents 

units. For example, under the H.263 video encoding the number of packets in the stream, or instance, 

standard, video information is encoded as a sequence of 20 A packet 440, shown in more detail in FIG. 4E, includes 

video frames (i.e., a frame being a presentation unit). a channel number field 441, which is reserved for a currently 

Although a video packet may be formed to correspond to a undefined future use, and an importance field 442 indicating 

single video frame, or presentation unit, advantages may be the relative importance of the packet to the perceived quality 

attained by dividing the presentation unit into several pack- of the presentation (more below). Each packet 440 also 

ets. In this fashion, a large video frame may be divided into 25 includes fields 443-444 that may be used for segmentation 

several packets so that each video packet is limited to a and reassembly (SAR) of presentation units. As outlined 

predetermined, maximum packet size. above, depending on the media instance and other 

By pre-packetizing the data, an appropriately designed circumstances, advantages may be attained by dividing 

server's processor load may be reduced by alleviating the video frames into multiple packets. To accomplish this, field 

processor from having to perform certain tasks such as 30 443 may be used to indicate the sequence number of the 

constructing packets on-the-fly from the media information. given packet relative to the multiple packets composing the 

With the invention, the server can simply read a packet from presentation unit, e.g., video frame. A total segments field 

the file and pass it to a UDP layer of the protocol stack via 444 indicates the total number of segments, 

a standard interface. The packet 440 further includes a packet length field 445 

In addition, by pre-packetizing the data, an appropriately 35 indicating the number of bytes of data contained in the 

designed server's memory requirements are by alleviating packet, a sequence number field 447 which is reserved for a 

the server from having to keep recently transmitted packets currently undefined future use, a time stamp field 448 

available in a "packet window" in memory. Packet windows indicating the start time for the packet relative to the start of 

are conventionally used to hold recently-transmitted net- the presentation, and a data field 460 containing the media 

work packets in case they need to be retransmitted because 40 data that will be eventually presented. This media data is 

they were lost in the network. The protocol being used encoded in a predetermined format, according to the media 

dictates the required size of a packet window, but it is not type, encoding type, subtype, and rate (see FIG, 2C); the 

uncommon in modern systems to have windows that require encoding format may be standardized, in the process of 

on the order of 100 kb of memory (RAM). Given that each being standardized, or proprietary. A reserved field 449 

network channel requires a corresponding packet window 45 provides additional storage for media type specific informa- 

and that it is not uncommon for current high-demand tion. 

Internet servers to support upwards of 5,000 simultaneous Each packet, like its descriptor, includes pre-assigned 

channels (with foreseeable demand growing to over 20,000 importance information, indicative of the relative impor- 

simultaneous channels in the near future), 5 12 Megabytes of tance of the packet with respect to the quality of the eventual 

expensive high-speed memory are needed just to support 50 presentation. As will be explained below, some media 

packet windowing. This requirement precludes many mod- frames are highly important; their absence from the presen- 

ern personal computers, and other small systems, from tation may make it unintelligible. Other frames are less 

operating as a server. In contrast, the invention obviates the important; their absence may be barely noticed or may be 

need for the packet windows and thus allows smaller, "concealable." ("concealing" is known) 

lower-cost systems to potentially operate as servers. 55 As will be described below when discussing an exemplary 

The organization of a generic media block 310 is shown server, importance information may be used by both the 

in FIGS. 4A-E and defines the basic template for a media server and the client. For example, the server may use the 

block. In short, the generic media block format describes information to intelligently adapt its streaming charactcris- 

certain features common to all media types and instances. As tics to better utilize the network's resources. The server may 

will be described below, specific media types, such as video, 60 "drop" packets from being sent, when needed, or send 

audio, and MIDI may need to "supplement" the generic multiple copies of packets, if beneficial. On the other hand, 

template. the client may use the information to intelligently maintain 

A generic media block directory 400, shown in FIG. 4A, synchronism. The client may drop relatively unimportant 

includes a directory preamble 410 and a number of packet video packets to maintain synchronism with the audio 

descriptors 420. The directory preamble 410, shown in more 65 stream being presented. 

detail in FIG. 4B, includes a packet count field 411, indi- Regarding the formal of specific media blocks, video 

eating the number of packets contained in the media block information may be formed into a media block, having data 
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in accordance with the H.263 format. H.263 specifies that The encoded video, originating from encoder 825, and the 

video streams will include "I" frames, "P" frames, and "PB" encoded audio, originating from encoder 830, are applied to 

frames. The "I" frames are independent frames in that they respective packetizers 835 and 840. Each packet izer 

represent a complete image, with no dependencies on the arranges the encoded data in pre-packetized form with the 

prior history of the video image. Therefore, I frames can be 5 necessary packet information, e.g., importance, packet 

considered as specifying the "state" of a picture. The "P" and length, and time stamp (see FIG. 4E). For the most part, each 

"PB" frames, on the other hand, specify the changes to a video frame will have its own corresponding packet, such as 

current picture, as defined by an I frame and intervening P a UDP packet, but as described above with regard to FIG. 

and PB frames. Therefore, P and PB frames are akin to "state 4E, a frame may be divided up into several packets, for 

changes," rather than states. A typical video stream would 10 example, if the video packet is very large. Analogous packet 

involve an I stream followed by P and PB frames, followed processing is applied to the audio frames, 

by more I, P, and PB frames. The processing of I, P, and PB The output of the packetizer is written to file 850, for 

frames is known. example, on a CD ROM, such that each media type uses a 

The format of the H.263 media block is generally the contiguous portion of a body 120 to contain the media block, 

same as the generic media block, described above with is The header 110, including the preamble 210 and media 

reference to FIGS. 3-4E. However, the H.263 media block instance descriptors 220 (see FIG. 2) are programmed with 

does supplement the generic format in certain areas. The the values, described above, as part of the final assembly of 

H.263 directory preamble 500, shown in FIG. 5, for the file 850. Much of the information, such as the importance 

example, differs from the generic template by including a information, is supplied by the encoders 825 and 830. 

field 512 indicating the maximum frame size in the presen- 20 The multimedia file 850 is easily modified to add or delete 

tation and by including a field 514 indicating the number of media blocks, for example, to add additional languages, data 

I -frames in the H.263 video stream, or instance. rates, or background music. To add a media block, the file 

Audio information may be formed into a media block, header is edited to include a new media instance descriptor, 

having data in accordance with the G.723 format. The and the media block is added into the file body. To delete a 

formal of the H.263 media block is identical to the generic 25 media block, the file header is edited to delete the corre- 

media block, and contains no additional information. sponding media instance descriptor, and the media block is 

MIDI information may be formed into a media block, removed from the file body, 
having data in accordance with a MIDI format. The format III. Overview of Device and System for, and Method of, 
of a MIDI media block is generally the same as the generic Presenting Multimedia Information 
media block. However, the MIDI media block does supple- 30 In addition to the above, the invention provides a device 
ment the generic format in certain areas. The MIDI directory and system for, and method of, presenting the multimedia 
preamble 600, shown in FIG. 6, for example, differs from the information in a manner that efficiently uses network band- 
generic template by including a field 612 indicating one of width and that efficiently uses a server's processor and 
a plurality of MIDI formats, a field 614 indicating the memory resources. As will be explained below, the device, 
number of tracks contained in the MIDI stream, and a 35 system, and method allow a server to support more simul- 
divisor field 616 used to specify the tempo of the MIDI taneous channels and allow a server to intelligently and 
stream. The MIDI packet 70O, shown in FIG. 7, differs from dynamically adapt its streaming characteristics, 
the generic packet 440 (see FIG. 4E) by including a field 749 More specifically, by using the pre-packetized 
indicating the start time of the MIDI packet relative to the information, the server needs less processor and memory 
start of the presentation and by including a field 750 indi- 40 resources to support a channel vis-a-vis conventional 
eating the relative end time of the MIDI packet. arrangements, and thus, a server having a given amount of 
II. Creating Files According to the Novel Format processor and memory resources can support more channels 

It is expected that developers will create multimedia files by using the invention. In addition, by using the importance 

in the format described above using various methods and information prc-assigncd to each packet, a server can 

systems. FIG. 8 shows but one simple example. For 45 dynamically adapt its streaming characteristic to send mul- 

simplicity, the exemplary system 800 creates a file 850 with tiple copies of a packet if the situation warrants, or to 

an audio stream(s) and video stream(s) only, but skilled eliminate certain packets from being sent if the situation 

artisans will appreciate the relevance of the teachings to warrants. 

other media types, e.g., MIDI, etc. A high-level architectural diagram of a system, embody- 

Media components, such as VCR 805, camera 810, and 50 ing the invention in a client-server context, is shown in FIG. 

microphone 815, generate media information, which is 9. In this context, a client 910 communicates with a server 

received by a sound and video capture system 820, for 920 via a communication network 930. The client may, for 

example, available from AVID Technologies or Adobe, Inc. example, include a PC, and the server may include a 

The capture system 820 may store the captured audio and UNIX-based workstation, minicomputer, or the like. The 

video information in a proprietary, but known, format. 55 communication network 930 may utilize a variety of known 

The captured information is applied to video encoder 825 physical mediums and protocols, such as modem links, 

and audio encoder 830, which encode the captured infor- LANs, using the TCP/IP suite or the like. The communica- 

mation into frames according to predetermined formats, lions network 930 allows control information 931 and data 

such as H.263 for video and G.723 for audio. For example, 932 to be exchanged between the client 910 and the server 

the video encoder 825 may use software -based logic to do 60 920. 

the encoding from the format of the capture system 820 into IV. A More Specific Example of the Device, System, and 

the predetermined format, e.g., H.263. An exemplary Method 

encoder is described in the U.S. patent application entitled One specific example of a client-server context in which 

"Improved Video Encoding System and Method", identified the invention may be practiced is the Internet. In the Internet 

and incorporated above. Analogous software-based process- 65 context, as will be appreciated by users familiar with the 

ing could be applied to the audio information, originating Internet, the client 910 might include a known browser 

from system 820. application for navigating access to various severs 920 
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connected to the Internet. Each server might use known web 
server applications to display web pages, which might 
include text and graphical icons, among other things. Some 
of the icons might be "active" in which case clicking the 
icon would cause a video clip to be presented to the client. 
The mechanisms for causing the presentation of the video 
clip are discussed below. 

An exemplary arrangement of using the invention in the 
Internet context is shown in FIG. 10. The client 910 includes 
conventional browser application 1010, and the server 920 
includes conventional web server application 1030 and web 
pages 1031, all of which will not be discussed in detail 
because they are known. The web browser application 1010 
cooperates with novel multimedia client application 1020 to 
initiate the processing of file 850, described above. In turn, 
multimedia client application 1020 cooperates with novel 
multimedia server application 1040 to produce the multi- 
media file 850 from the server. The interaction and coop- 
eration of the above entities are further described below. 

a. Set Up 

The web browser application and the web server appli- 
cation interact through a known standard interface, which 
will be described briefly. The web browser application is 
used to view web pages on various servers in the network. 
A web page may allow access to certain files stored on the 
server, including multimedia files such as file 850 (see FIG. 
8). To view a file, the web browser requests a Universal 
Resource Locator (URL) from the web server application, 
and the web server application responds with a message 
which includes a MIME type which specifies the location 
and type of the file. Based on the file type, the web browser 
application may need to invoke a "helper" application, 
which is used specifically to handle files having the specified 
file type. In the case of multimedia file 850, the web browser 
application invokes the novel multimedia client application, 
which initiates an interaction with the novel multimedia 
server application to produce the multimedia file. 

An exemplary initial interaction between the multimedia 
client application 1020 and the multimedia server applica- 
tion 1040 is described below with reference to FIG. 11. The 
interaction begins in step 1100 and proceeds to step 1110, 
where the multimedia client application 1020 sends a media 
request message to the multimedia server application 1040 
specifying the desired media types to be produced, and 
specifying a version number of the multimedia client appli- 
cation and a port number to which the multimedia server 
application should direct communication. In step 1120, the 
multimedia client application 1020 sends a message to the 
multimedia server application 1040 specifying a desired rate 
of transmission. The desired rate of transmission may be 
determined by the client, for example, by determining the 
communication rate of an attached communication device 
such as a modem (more below). The interaction then pro- 
ceeds to step 1130, where the multimedia client application 
1020 sends a "go" message to the server application 1040, 
informing the multimedia server application that the initial 
client messages have been sent, and therefore allowing the 
server to determine whether or not all messages were 
received. 

Upon receipt of the "go" message, the interaction pro- 
ceeds to step 1140, where the multimedia server application 
1040 sends a media response message to the multimedia 
client application 1020 specifying, among other things, the 
time of day as determined by a reference clock in the server 
or network. In step 1150, the multimedia server application 
1040 sends a message to the multimedia client application, 
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which immediate responds in step 1160 with another mes- 
sage. The server application 1040 uses the above "echo" to 
calculate a round-trip delay of the network. In step 1170, the 
multimedia server application 1040 sends a configuration 

5 message to the multimedia client application 1020 
specifying, among other things, the sizes and relationships 
of the information to be produced. The multimedia client 
application 1020 uses the configuration information to con- 
trol its consumption of the information produced, or 

10 streamed, by the server. Finally, in step 1180, the multimedia 
server application 1040 sends a "go" message to the client 
application 1020, indicating the end of the initial interaction. 
The interaction ends in step 1199, and the server is ready to 
begin streaming the media data from the file. 

35 b. Streaming 

In an exemplary embodiment, the streaming process is 
performed using the UDP transport protocol of the TCP/I EP 
protocol suite. The UDP protocol is an "unreliable protocol" 

2Q in that it assigns responsibility of reliable transmission to 
higher layers. Thus, the client logic and the server logic are 
responsible for detecting lost packets and the like and 
performing the appropriate action. This situation is unlike 
other transport protocols such as TCP in which detection of 

25 lost packets and responding to such detection is handled by 
the transport and lower layers. 

The streaming logic of the server is shown at a high level 
of abstraction in FIG, 12. The logic starts at step 1200 and 
proceeds to step 1210 where it is determined whether 

30 streaming should continue. This step includes determining 
whether all of the packets for the requested presentation 
have already been streamed, whether an error has occurred, 
whether insufficient bandwidth exists to make continued 
presentation impractical, whether the user has requested the 

35 presentation to stop, whether the user has disconnected from 
the server, and the like. If streaming is to stop, the logic 
proceeds to step 1299, which ends the flow. If streaming is 
to continue, the logic streams a time-slice's worth of packets 
in step 1220. The actual logic of this step is described below. 

40 After streaming a time-slice's worth of the packets in step 
1220, the logic proceeds to step 1230 where the logic 
requests to go to sleep, or go idle, for a time quanta 
corresponding to the streaming rate. For example, if the 
logic is streaming audio to correspond to 960 ms slices of 

45 presentation, step 1220 will stream 960 ms worth of 
information, and step 1230 will request to go to sleep until 
it is time to stream another 960 ms worth of information. 
Break 1235 indicates that a break in the flow of control from 
step 1230 back to step 1210 may occur (more below). 

50 Assuming no break in the flow, the above logic will continue 
to loop and stream media packets, until one of the 
conditions, discussed above with regard to step 1210, is met. 

As alluded to above, modem networks lose packets for 
various reasons. If the client detects a lost packet it may send 

55 to the server a retransmission request for the missing packet. 
Given that the exemplary embodiment uses UDP, this 
retransmission request is handled by the server logic 1040, 
rather than the lower layers of the protocol. 

In response to a retransmission request, the logic 1040 

60 will break at point 1235 and proceed to step 1240 which will 
retransmit the packet. This logic, like the logic of step 1220, 
is discussed below. The logic then proceeds to step 1250 
which determines whether there is more time during which 
the logic should remain asleep. If so, the logic proceeds to 

65 step 1260, where the logic goes back to sleep for the 
remaining quanta set in step 1230. If not, the logic proceeds 
directly to step 1210 to continue streaming. 
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The streaming and retransmit logic of steps 1220 and 
1240 are described below. A few general words are war- 
ranted before the detailed discussion. First, the streaming 
and retransmit steps may determine not to transmit certain 
packets at all. Second, the streaming and retransmit logic 
dynamically adapt the streaming characteristics of the server 
logic by either eliminating certain packets from being 
transmitted, or by sending multiple copies of other packets 
to increase the probability that the client will receive a 
complete sequence of packets. This dynamic adaptation is 
based on information contained in the packets and their 
descriptors and on network statistics gathered before and 
during the streaming operation. More specifically, the sta- 
tistics are used to create "inferences" of the network state, 
which are used by the logic, described below, to adapt its 
behavior. 

The server logic, as a general matter, indexes into and 
iterates over the various streams involved in the presenta- 
tion. (Indexing and iteration are known) When retransmis- 
sion is involved, the server logic indexes into the file to 
select the appropriate packets, while maintaining the itera- 
tion pointers for the normal streaming loop, i.e., steps 
1210-1235 (see FIG. 12) 

The presentation is divided into time quanta, which is 
typically dictated by the audio presentation rate and quanta, 
e.g., 960 ms. The server logic selects the appropriate 
packets, using indexing and iteration, from the file contain- 
ing the various streams of the presentation. An exemplary 
embodiment selects the packets to be transmitted by com- 
paring a clock value maintained by the server logic and in 
rough synchronism with the client with the time stamps 422 
(see FIG. 4C) corresponding to the packetized data. Alter- 
native arrangements include selecting the related packets 
based on a ratio relating the various streams; for example, 
several audio packets may correspond to a single video 
packet. Similarly, if other file organizations are used, for 
example, pre-interleaved audio and video, the server logic 
would select the appropriate segments. 

Once the next packets to potentially send are identified, 
the logic determines whether they should be sent and, if so, 
how many copies should be sent. As outlined above, this is 
determined on statistic-based inferences (more below) and 
upon pre- assigned importance fields of the packets. An 
exemplary embodiment uses the number 1 for highest 
importance packets, with higher numbers corresponding to 
lower importance. As discussed above, these importance 
fields are assigned during the creation of the file, for 
example, with the encoder outlined above. Some encodings 
that are expected to be followed by all implementations, and 
which are followed by the exemplary embodiment, are that 
audio packets and I frames of video will be assigned an 
importance of 1, i.e., highest priority. In this fashion, as will 
be evident from the description below of the detailed logic 
of FIG. 13, these packets will always be transmitted and 
retransmitted when requested. 

The logic of FIG. 13, described below, utilizes statistics 
gathered on the network. An exemplary embodiment imple- 
ments the statistics gathering function in the client 910, 
because it will be the only entity capable of gathering certain 
statistics. If a different set of statistics is utilized in other 
embodiments, it may be beneficial to place the responsibility 
of gathering statistics with the server. Moreover, either the 
client 910 or the server 930 could perform the analysis on 
the statistic to create inferences or the like. In the former 
case, the client would gather and analyze the statistics and 
send the inferences to the server as control packets for use 
by the logic of FIG. 13. In the latter case, the client would 
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send the statistics to the server, which would then analyze 
the statistics to create the inferences used by the logic of 
FIG. 13. 

The currently envisioned set of statistics necessary to 
5 create the inferences used by the logic of FIG. 13 include the 
following: 

1. bit rate throughput; 

2. network jitter; 

3. round-trip delay; and 

10 4. percentage and distribution of packet loss. 

Statistics 2-4 may be obtained with conventional tech- 
niques. Consequently, the gathering of these statistics are not 
discussed. 

Statistic 1, however, needs to implement novel fnnction- 
15 ality as not all communication devices, e.g., certain model 
modems, are capable of providing these statistics during the 
streaming operation. An exemplary client determines 
throughput by monitoring and analyzing U ART interrupts to 
determine, or infer, the modem's throughput measured as a 
20 bit rate. 

As outlined above, the transmission and retransmission 
steps of FIG. 12 are highly abstracted. The detailed logic of 
these steps is described with reference to FIG. 13. 
The logic begins at step 1300 and proceeds to step 1310, 

25 where an inference is checked to see if there is a modem or 
similarly related network performance problem. This infer- 
ence is created from a detected decrease in throughput (e.g., 
bit rate) without a corresponding detection of significant 
packet loss. If there is such a problem, the logic proceeds to 

30 step 1320 where a "throttle down" operation is performed, 
which sets a flag indicating that a packet with importance 
fields equal to 5 should not be sent, or more specifically that 
a packet whose descriptor has an importance field of 5 
should not be sent. As stated above, the assignment of fields 

35 depends on the encoding of the file, but some examples of 
expected packets having assignments equal to 5 include the 
last P or PB frames immediately before an I frame of video 
in a video stream. These frames are relatively less important 
because the following I frame will reset the state and there 

40 are no intervening states depending on these P frames. 

If there is no modem or analogous problem in step 1310, 
the logic proceeds to step 1330 to determine if there is a 
network-based problem. This inference is created from 
detecting a significant packet loss. If there is no network- 

45 based problem, the logic proceeds to step 1340, which 
indicates that no throttling should occur, in short, that the 
network is performing adequately and that the streaming 
characteristics should be normal and not modified. 

If there is a network problem in step 1330, the logic 

50 proceeds to step 1350 to determine whether or not the 
network problem involved random packet loss. If not, mean- 
ing that the packet loss involved a significant number of 
contiguous or nearly-contiguous packets in a small time 
frame, the inference is that the network is experiencing 

55 congestion, and the logic proceeds to step 1360, where a 
throttle down operation is performed to set a flag that 
packets with importance 4 or higher should not be sent. 

If step 1350 indicates that the problem is random packet 
loss, the logic proceeds to step 1370, where a "throttle up" 

60 operation is performed if the statistics indicate that the 
network has available bandwidth. The available bandwidth 
is determined as part of the initial characterization of the 
network, discussed above with regard to FIG. 11B and 
which may be continually updated as part of the concurrent 

65 network characterization. More particularly, flags will be set 
to indicate that packets having importance less than 3 should 
be sent multiple times, e.g., in duplicate. 
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Steps 1320, 1340, 1360, and 1370, then proceed to step 
1380 which sends the time quanta's worth of packets, 
discussed above, in consideration of the various flags set 
during throttle up and down steps. The logic then ends in 
step 1399. 5 

The throttle down operations eliminate certain packets 
from being transmitted or retransmitted. In short, the under- 
lying philosophy is that, in these situations, sending the 
packet will be to no avail or will exacerbate the detected and 
inferred problem. For example, if there is low throughput 10 
because of a modem problem, it does not make sense to send 
all of the packets because the client will likely not receive 
them in sufficient time for synchronized presentation. 
Moreover, if the network is congested or the like, resending 
packets will only increase the congestion and worsen per- 15 
formance. 

The throttle up operations transmit, or retransmit, certain 
packets if bandwidth is available. In short, the underlying 
philosophy is that, if the network resources are available, 
then they should be utilized to increase the likelihood that 20 
the presentation will include more of the media information 
(i.e., less likely to drop packets for synchronization opera- 
tions at the client's end). 

In both cases, the throttling is performed based on the 
importance information in the packet, which as outlined 25 
above indicates the importance of the packet to the quality 
of the presentation. Thus, less important packets are elimi- 
nated if the situation dictates, and more important packets 
are repeated if the situation dictates. The underlying effect is 
that the streaming characteristics are dynamically adapted in 30 
consideration of the importance of the information to the 
quality of the presentation. 

Skilled artisans will appreciate that the assignment of 
importance fields may change and that other inferences than 
those discussed above are easily incorporated into the above 35 
logic. Moreover, the logic may also be adapted to dynami- 
cally change a streaming subtype if the performance situa- 
tion warrants. 

Skilled artisans will also appreciate that the above archi- 
tecture may easily incorporate special effects processing, the 40 
logic of which will be evident, given the above description. 
For example, the server may provide fast forward and 
reverse functions. In such cases, the streaming logic, for 
example, could consider I frames only and ignore the P and 
PB frames. 45 

Skilled artisans will also appreciate that, although the file 
was particularly discussed in relation to UDP-ready packets, 
the invention is applicable to other protocols, other layers 
than transport, and other segmentations of information, e.g., 
fixed size cells. 50 

Skilled artisans will also appreciate that, although the 
media blocks were described as a contiguous arrangement, 
the invention is applicable to a single media block contain- 
ing multiple media types arranged in a pre-interleaved 
format. 55 

Skilled artisans will also appreciate that, although the file 
format was discussed in which media units, such as large I 
frames, are sub-divided, the invention is also applicable to 
merging smaller frames into larger packets. 

Skilled artisans will also appreciate that, although the 60 
characterization of the network was discussed as a general 
characterization, the invention is also applicable to charac- 
terizing the network on a channel-by-channel basis and by 
testing the network with a test audio stream and video 
stream, for example. 65 

The present invention may be embodied in other specific 
forms without departing from its spirit or essential charac- • 



teristics. the described embodiments are to be considered in 
all respects only as illustrative and not restrictive. The scope 
of the invention is, therefore, indicated by the appended 
claims rather than by the foregoing description. All changes 
which come within the meaning and range of equivalency of 
the claims are to be embraced within their scope. 
What is claimed is; 

1. A multimedia file embodied in a computer-readable 
medium for storing a computer-readable multimedia presen- 
tation containing a number of media types, the multimedia 
file comprising: 

a number of media blocks, each media block comprising 
an instance of the multimedia presentation containing 
encoded information for one of the number of media 
types, the number of media blocks including a plurality 
of media blocks having the same media type and 
different encodings; and 

a media instance descriptor for each of the number of 
media blocks, each media instance descriptor indicat- 
ing the media type and encoding for the corresponding 
media block. 

2. The multimedia file of claim 1 comprising: 

a file body containing the number of media blocks; and 
a file header containing a file header preamble having 
information for identifying the multimedia file and 
further containing the media instance descriptor for 
each of the number of media blocks in the file body, 
where each media instance descriptor includes: 
a media block offset field indicating a starting offset of 
the corresponding media block within the file body; 
a media type field indicating the media type of the 

corresponding media block; 
an encoding type field indicating an encoding format of 

the corresponding media block; 
a subtype field indicating a media subtype correspond- 
ing to the media type and the encoding format of the 
corresponding media block; and 
a rate field indicating an encoding rate of the corre- 
sponding media block. 

3. The multimedia file of claim 2 wherein the file header 
preamble comprises: 

A) a header size field for storing the size of the file header; 
and 

B) a descriptor count field for storing the number of media 
instance descriptors in the file header. 

4. The multimedia file of claim 2 wherein each media 
block comprises: 

A) a media block body field containing data used for 
presentation by the presentation application; and 

B) a media block directory field having information for 
locating data contained in the media block body field. 

5. The multimedia file of claim 4 wherein the media block 
body field comprises a plurality of packets, each packet 
organized according to a predefined network protocol and 
each containing one of: 

A) an unsegmented presentation unit; and 

B) one segment of a segmented presentation unit. 

6. The multimedia file of claim 5 wherein the media block 
directory field of each media block comprises: 

A) a directory preamble having information for describing 
the contents of the media block; and 

B) a plurality of packet descriptors, each packet descriptor 
having information for identifying one packet in the 
media block body. 

7. The multimedia file of claim 6 wherein the directory 
preamble comprises: 
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A) a packet count field for indicating the number of 
packets contained in the media block body; and 

B) a presentation length field for indicating the duration of 
the presentation. 

8. The multimedia file of claim 7 wherein one of the 
media types is video and wherein the directory preamble for 
a video media instance comprises: 

A) a maximum frame size field for indicating the size of 
the largest video frame in the video instance; and 

B) an I-frame count field for indicating a number of 
I-frames in the video instance. 

9. The multimedia file of claim 8 wherein one of the 
media types is MIDI and wherein the packet for a MIDI 
media instance comprises: 

A) a start time field for indicating a start time for the MIDI 
instance relative to the start of the presentation; and 

B) an end time field for indicating an end time for the 
MIDI instance relative to the start of the presentation. 

10. The multimedia file of claim 7 wherein one of the 
media types is audio. 

11. The multimedia tile of claim 7 wherein one of the 
media types is MIDI and wherein the directory preamble for 
a MIDI media instance comprises: 

A) a MIDI format field for indicating one of a plurality of 
MIDI formats; 

B) a track count field for indicating a number of tracks 
contained in the MIDI instance; and 

Q a divisor field for specifying a tempo for the MIDI 
instance. 

12. The multimedia file of claim 6 wherein each packet 
descriptor comprises: 

A) a packet offset field for indicating an offset from the 
start of the media block to a corresponding packet; 

B) a packet descriptor time stamp field for indicating a 
start time for the corresponding packet, relative to the 
start of the presentation; 

C) a packet descriptor importance field for indicating a 
relative importance of the corresponding packet to the 
presentation; and 

D) a packet length field for indicating the size of the 
corresponding packet. 

13. The multimedia file of claim 5 wherein each packet 
comprises: 

A) a packet importance field for indicating a relative 
importance of the packet to the presentation; 

B) a length field for indicating the length of the packet; 
Q a time stamp field for indicating a start time for the 

packet, relative to the start of the presentation; and 
D) a data field for storing media information. 

14. The multimedia file of claim 13 wherein each packet 
further comprises: 

A) a total segments field for indicating a number of 
sequentially -numbered packets into which a presenta- 
tion unit of media information is segmented; and 

B) a segment number field for indicating a relative 
sequence number among the number of sequentially- 
numbered packets. 

15. The multimedia file of claim 2 wherein one of the 
media types is video and wherein the file body includes a 
plurality of video media blocks. 

16. The multimedia file of claim 15 wherein each video 
media block is encoded for a different network rate. 

17. The multimedia file of claim 16 wherein the rate field 
of each video media block indicates the corresponding 
encoding rate for which it was encoded. 
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18. The multimedia file of claim 2 wherein one of the 
media types is audio and wherein the file body includes a 
plurality of audio media blocks. 

19. The multimedia file of claim 18 wherein each audio 
5 media block is encoded for a different language. 

20. The multimedia file of claim 19 wherein the subtype 
field of each audio media block indicates the corresponding 
language for which it was encoded. 

21. The multimedia file of claim 2 wherein the media 
10 types include video, audio, and MIDI, and wherein the file 

body includes: 

A) at least one video media block; 

B) at least one audio media block; and 

C) at least one MIDI media block. 

1 22. A method of forming a multimedia file on a computer- 
readable medium from a multimedia presentation containing 
a number of media types, the method comprising the steps 
of: 

20 forming a number of encoded media streams from at least 
one of the number of media types, the number of 
encoded media streams including a plurality of encoded 
media streams having the same media type and differ- 
ent encodings; 

25 packetizing each of the number of encoded media streams 
to form a number of packetized media streams; 
forming a media block for each of the number of pack- 
etized media streams; 
forming a media instance descriptor for each of the media 
30 blocks, each media instance descriptor indicating the 
media type and encoding for the corresponding media 
block; and 

forming on the computer-readable medium the multime- 
dia file including the media block for each of the 
35 number of packetized media streams and the media 
instance descriptor for each of the media blocks. 

23. The method of claim 22 wherein each media instance 
descriptor includes: 

a media block offset field indicating a starting offset of the 
40 corresponding media block within the file body; 

a media type field indicating the media type of the 

corresponding media block; 
an encoding type field indicating an encoding format of 
45 the corresponding media block; 

a subtype field indicating a media subtype corresponding 
to the media type and the encoding format of the 
corresponding media block; and 
a rate field indicating an encoding rate of the correspond- 
50 ing media block; and wherein the step of forming the 
multimedia file comprises the steps of: 
forming on the computer-readable medium a file body 

containing the media blocks; and 
forming on the computer-readable medium a file header 
55 containing a file header preamble having information 

for identifying the multimedia file and further con- 
taining the media instance descriptor for each of the 
media blocks. 

24. The method of claim 23 wherein the step of encoding 
60 the media information comprises the step of encoding the 

media stream for a predetermined data rate. 

25. The method of claim 23 wherein the step of packetiz- 
ing the encoded media stream to form a packetized media 
stream comprises assigning a time stamp to each packet. 

65 26. The method of claim 25 wherein the step of packetiz- 
ing the encoded media stream further comprises the step of 
recording an assigned importance value to each packet. 
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27. The method of claim 23 further comprising the step of 
adding an additional media stream to an existing multimedia 
file. 

28. The method of claim 27 comprising the steps of: 

A) encoding the additional media stream to form an 
additional encoded media stream; 

B) packetizing the additional encoded media stream to 
form an additional packet ized media stream; 

Q forming an additional media block from the additional 
packetized media stream; 

D) adding the additional media block to the file body of 
the existing multimedia file; and 

E) updating the file header of the existing multimedia file 
to describe the contents of the file body. 
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29. The method of claim 23 further comprising the step of 
deleting an existing media stream from an existing multi- 
media file. 

30. The method of claim 29 comprising the steps of: 

A) deleting the existing media block from the file body of 
the existing multimedia file; and 

B) updating the file header of the existing multimedia file 
to describe the contents of the file body. 

31. The multimedia file formed by the method of claim 
22. 
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